Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding gujarati vocabulry dec 4 #1811

Closed
wants to merge 7 commits into from
Closed

Conversation

sarjil77
Copy link
Contributor

@sarjil77 sarjil77 commented Dec 3, 2024

here i am adding the gujarati vocabulary

@@ -22,6 +22,13 @@
"hindi_letters": "अआइईउऊऋॠऌॡएऐओऔअंअःकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह",
"hindi_digits": "०१२३४५६७८९",
"hindi_punctuation": "।,?!:्ॐ॰॥॰",
"gujarati_vowels": "અઆઇઈઉઊઋએઐઓઔઅંઅઃ ",
"gujarati_digits":"૦૧૨૩૪૫૬૭૮૯",
"gujarati_diacritics_consonants":"""કકાકિકીકુકૂકૃકેકૈકોકૌકંકઃખખાખિખીખુખૂખૃખેખૈખોખૌખંખઃગગાગિગીગુગૂગૃગેગૈગોગૌગંગઃઘઘાઘિઘીઘુઘૂઘૃઘેઘૈઘોઘૌઘંઘઃઙઙાઙિઙીઙુઙૂઙૃઙેઙૈઙોઙૌઙંઙઃચચાચિચીચુચૂચૃચેચૈચોચૌચંચઃછછાછિછીછુછૂછૃછેછૈછોછૌછંછઃ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarjil77 I tested a bit with your added vocab which raised some issues .. because we can't encode it - if i understood it correctly the leading letter combined with the dotted circle (for example: કૌ) is combined to one character but programmatically it's counted as 2 characters .. is there anyway to make these strings unicode conform ?

So at the end that each character in an image corresponds to 1 encoded character

if i filter your diacritics i get the following:

ઃકખગઘઙચછજઝઞટઠડઢણતથદધનપફબભમયરલવશષાિીુૂૃેૈોૌ્

btw with multiline strings the string needs to end with \ otherwise it's counted as linebreak

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarjil77 Something like this:

"gujarati_letters": "તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ",
"gujarati_digits":"૦૧૨૩૪૫૬૭૮૯",
"gujarati_punctuation": "૰ઽ◌ંઃ॥ૐ" + "૱",

length: 103
all chars: તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ૦૧૨૩૪૫૬૭૮૯૰ઽ◌ંઃ॥ૐ૱!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

? Not sure anyway 😅

This is what i get if i deduplicate it in python

the single diacritics (as addition to a char) are counted as standalone symbol
Screenshot from 2024-12-04 10-49-46

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @felixdittrich92 , noted, i am not sure right now, but i will look further into this.
:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello @felixdittrich92, you are right it is considering 2 characters like "ફુ્" which is diacritic which is taking 6 bytes. So in order to handle the diacritics to consider as a single character, we can use NFC (Normalization Form C) which will combine character with their diacritics into single code character and does not change the actual encoding or byte representation.

for eg:
import unicodedata

txt = "ફુ્"

encoded_string = txt.encode()

normalized_text = unicodedata.normalize('NFC', txt)

print(f'encoded string is:',encoded_string)
print(f'the length of encoded string is: {len(encoded_string)} ')
print(f'normalized_text is:', normalized_text)
print(f'the length of normalized encoded string is:{len(normalized_text)}')

output:
encoded string is: b'\xe0\xaa\xab\xe0\xab\x81\xe0\xab\x8d'
the length of encoded string is: 9
normalized_text is: ફુ્
the length of normalized encoded string is:3

please do have a look on this, and i do not know how other people have added diacritics, here we can also add just consonants and vowels but it will not make any sense.

Let me know your thoughts.

@felixdittrich92
Copy link
Contributor

@sarjil77 Take a look here should be enough to copy paste these changes: main...felixdittrich92:doctr:gujarati-vocab-test

Tested with your "full" vocab i can completely reproduce it so all chars are added and a string can be encoded char by char :)

Before you should pull the latest changes from main and rebase your branch :)

@sarjil77
Copy link
Contributor Author

sarjil77 commented Dec 6, 2024

@felixdittrich92 The gujarati letters you provided in your commits contains the significant portion of Gujarati alphabets but is not entirely complete you are missing 2 major vowels and 3 consonants so total 5 letters are missing. based on diacritcs which i have provided before you are right, but this 5 letters are additional ones and doesnt have diacritics so i missed them (sorry) but they are also frequently used so we cant ignore them. And i would strongly recommend to keep vowels and consonants separate, it will be better for traceability and in future it may help to trace.

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Dec 6, 2024

@felixdittrich92 The gujarati letters you provided in your commits contains the significant portion of Gujarati alphabets but is not entirely complete you are missing 2 major vowels and 3 consonants so total 5 letters are missing. based on diacritcs which i have provided before you are right, but this 5 letters are additional ones and doesnt have diacritics so i missed them (sorry) but they are also frequently used so we cant ignore them. And i would strongly recommend to keep vowels and consonants separate, it will be better for traceability and in future it may help to trace.

Then i would say feel free to add the missing ones - what you see is your vocab but deduplicated :)
I simply did "".Join(sorted(list(set(VOCABS["gujarati_letters"])))) to filter them
Also for the splitting feel free would be fine on my end 👍🏼

@felixdittrich92
Copy link
Contributor

@sarjil77 Don't miss to rebase before please in the meanwhile i added a test case to Check the VOCABS entry values for duplicates :)

@sarjil77
Copy link
Contributor Author

sarjil77 commented Dec 6, 2024

@felixdittrich92 i think done from my side :) haa,

thanks.

@felixdittrich92
Copy link
Contributor

@sarjil77 your branch needs still to be rebased (see there is a conflicting file) :)

  1. Update your fork
  2. Checkout main and pull
  3. Checkout your branch and pull / rebase
  4. Then force push your changes

Additional the docs entry is missing take a look in my provided branch :)

@sarjil77
Copy link
Contributor Author

sarjil77 commented Dec 6, 2024

i think now it is good. :)

@felixdittrich92
Copy link
Contributor

@sarjil77 it's still not rebased on main :)
See:

This branch has conflicts that must be resolved
Use the web editor or the  to resolve conflicts.
Conflicting files
doctr/datasets/vocabs.py

And about the documentation entry if you have added more chars i think the number and char string has changed also ;) so please fix this :D

felixdittrich92 and others added 4 commits December 10, 2024 00:24
changes with vocab and documentation dec 10
Bumps the github-actions group with 1 update: [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action).


Updates `JamesIves/github-pages-deploy-action` from 4.7.1 to 4.7.2
- [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases)
- [Commits](JamesIves/github-pages-deploy-action@v4.7.1...v4.7.2)

---
updated-dependencies:
- dependency-name: JamesIves/github-pages-deploy-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
updating both vocab and documentation
@sarjil77
Copy link
Contributor Author

sarjil77 commented Dec 9, 2024

hey @felixdittrich92 , i think i have chnaged both the documentation and vocab, can you please look into it, and from my side i have checked for any conflicts. :)

@felixT2K
Copy link
Contributor

@sarjil77 Looks like you merged the main branch into your feature branch instead of rebasing your branch on main 😅
See: https://github.com/mindee/doctr/pull/1811/files (It includes some previous already merged commits)

2 options:

  • Rebase your branch on main git checkout <YOUR_FEATURE_BRANCH> -> git rebase main (preffered option)
  • Close this PR and create a new feature branch / Add your changes / create a fresh PR

👍🏼

@sarjil77
Copy link
Contributor Author

OKay i am closing this PR and will do as you suggested.
how can i be so stupid, let me correct it, i thought it was okay, sorry. :D

@sarjil77 sarjil77 closed this Dec 11, 2024
@felixdittrich92
Copy link
Contributor

OKay i am closing this PR and will do as you suggested. how can i be so stupid, let me correct it, i thought it was okay, sorry. :D

Don't worry that's little things you will grow on believe me :) Without making things wrong we wouldn't learn ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants